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# Abstract 

In many applications one would like to use infor- 
mation from both color and texture features in order 
to segment an image. We propose a novel technique 
to combine “soft” segmentations computed for two or 
more features independently . Our algorithm merges 
models according to a mean entropy criterion , and al- 
lows to choose the appropriate number of classes for 
the final grouping. This technique also allows to im- 
prove the quality of supervised classification based on 
one feature (e.g. color) by merging information from 
unsupervtsed segmentation based on another feature 
(e.g., texture.) 

1 Introduction 

Image segmentation is a fundamental task in Com- 
puter Vision. Color and texture provide powerful cues 
for segmenting a still image, and much, work has been 
devoted to developing grouping algorithms based on 
these two features [I], [3], [5]. In fact, most of the lit- 
erature deals with segmentation based on either color 
or texture; this work was originated by the intuition 
that using information provided by both features, one 
should be able to obtain more robust and meaningful 
results. 

Underlying our approach is the hypothesis that in 
typical images color and texture features are not sta- 
tistically independent. Perhaps the simplest way to 
exploit this dependency is to concatenate the color 
and texture feature vectors together, and then run the 
grouping algorithm of choice on these super-vectors. 
This approach, however, may give the feeling of “com- 
paring apples with oranges” . Indeed, color and texture 
features often have very different statistical behaviors; 
one may prefer to use the most suitable grouping algo- 
rithm for each feature separately, and then somehow 
combine the results of t he two segmentations together. 

This work introduces a strategy to merge together 
in a Bayesian framework segmentations computed on 


color and texture features independently. The only 
requirement is that the segmentations are expressed 
in terms of posterior probabilities [2]. Note that most 
clustering algorithms based on mixture models explic- 
itly compute estimates of the posterior distributions, 
and do the final assignment by Bayesian classification 
(i.e., they assign a feature to the component of the 
mixture model that most likely generated that fea- 
ture.) 

For example, in Figure 2 (b) and (c) we show in- 
stances of color and texture segmentation of the image 
in Figure 2 (a). The texture features are formed by 
the absolute values of the outputs of a bank of Gabor 
filters, smoothed by a gaussian kernel to enforce spa- 
tial coherence [3]. The mixture model in both cases 
has been estimated by Expectation Maximization [2]; 
the “hard” segmentation shown in the figures is the 
result of Bayesian classification based on such mix- 
ture models. Both mixture models have four classes, 
although our algorithm can accept any combination 
of classes. The scene in figure 2(a) is composed by 
a small number of homogeneous parts: two bushes, a 
paved road on the right, dirt soil on the left, a shadow 
area near a bush and piece of dark background. The 
color segmenter (figure 2(b)) successfully separates the 
“bush”, the '‘background 1 ’ and the “road” areas, but 
is unable to discriminate the “road” from “soil” parts, 
which have very similar color. The texture segmenter 
does separate the “road” and “soil” areas, but cannot 
discriminate the ‘road” from the “background” parts; 
in addition, it assigns the “soil” area to two distinct 
classes of the mixture model. 

Our technique for model fusion involves two steps. 
First, the two models are merged by a ‘cartesian prod- 
uct” operator, discussed in section 2. This operation 
preserves all the information about the models, but 
has the disadvantage of creating a large number of 
classes, equal to the product of the number of classes of 
the two original models. Then, the number of classes 



of the combined model is reduced by a technique, pre- 
sented in section 3, that “clips together” sets of classes. 
Such classes are selected on the basis of a mean en- 
tropy criterion that minimizes the loss of “descriptive- 
ness”; the mean entropy criterion also provides useful 
information for choosing the appropriate number of 
classes for the final model. An intriguing application 
of our algorithm is discussed in section 4, and involves 
information fusion from supervised classification (e.g., 
based on color) and unsupervised segmentation (e.g., 
based on texture.) The unsupervised segmentation is 
used to leverage the estimates provided by the trained 
model, resulting in a more accurate classification. 

2 f Cartesian product of mixture models 
Our merging technique starts from K given mixture 
models [2] (called “models” in the following.) The i-th 
model, Mi , is composed by N{ classes, and defines a 
probability density function p t ( z *) : 


iV, 
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j = l 


where r,-, the observed feature, lives in a space Z, . 
For example, z, may be a color vector, or a texture 
feature in a multiscale/multiorientation space. The 
conditional likelihood functions Pi{z{\j) and the pri- 
ors Pi(j) specify the model completely. The posterior 
distributions are given by Bayes' rule: 
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P i {j I ~i ) is the probability that the observed feature Z{ 
was generated by the class of index j. The Bayesian 
classifier for Mi assigns a feature z { to the class in- 
dexed by the location of the maximum of Pi(j\zi). To 
simplify our presentation, we will assume in the follow- 
ing that all the priors are strictly positive: if a prior 
Pi(j) is null, we can safely remove the class with index 
j from the model. 

The cartesian product M of the models Mi is a new 
model with probability distribution over Z\X...xZ\. 
M is completely specified by the following axioms: 

l M has X = classes, corresponding to 

the cartesian product of the classes of the models 
M t : j <-+ (ji. . . . , jy). 


2. The conditional likelihood of the feature c = 

(-1 Z K ) given the class of index j is equal 

to p {'\ j ) = n,=iPi(-iU.) 

3. The priors factorize as P(j) = n, K =i PiUi)- 


It follows straightforwardly that the likelihood and the 
posteriors of the cartesian product of models factorize 
as well: 


K K 

?(♦) = n«(«> ■ p oi-) = n (3) 
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Note that all the information about the A' original 
models is preserved in their cartesian product M . The 
Bayesian classifier for M assigns a feature - to the 
model j such that j, is the class as- 

signed to Zi by the Bayesian classifier for M t . Figure 2 
(d) shows the Bayesian segmentation relative to the 
cartesian product of the color and texture models of 
figure 2 (b) and (c). The new model has 16 classes. In 
the next section we describe a procedure to reduce the 
dimensionality (i.e., the number of classes) of a model, 
in such a way that the loss of “descriptive ness” of the 
model is minimized. 

3 Dimensionality reduction 

Assume we are given a model M with N classes. 
We introduce here a technique to build a new model 
that has fewer classes than M but explains the data 
exactly as M, i.e., it defines the same likelihood p(z) 
bsM. Suppose for example that we want to reduce the 
dimensionality of the model to N - M. Our strategy 
is very simple: we just “clip together” M + 1 classes of 
M into a new super-class, leaving the other classes un- 
touched. We may decide, for instance, to clip together 
the classes of index N - Af, . . . , N . The probability 
that a feature z was generated by the union of such 
classes according to M is equal to the sum of the cor- 
responding posteriors. This is the value that we assign 
to the posterior P new (X - M\z) for the new model; 
the posteriors for the other classes are the same as in 

M: 


P new (j\z) = P(j\z) , 1 <j<N-M 

N 

P ne "(X -\I\;)= Y, PU\ = ) 

j-N- \( 

If in addition we impose that the likelihood function 
p(z) is the same in both models, the new model is 
completely specified. 

In general, to reduce the model dimension from N 
to .V - \[ , we may choose any L < M disjoint groups of 
classes with Li components each, such that ^ L =1 L t = 
L -f .V/, and clip together the classes in each group. 
A criterion for the selection of the most appropriate 
clipping scheme is presented in the next section. 



3.1 Mean entropy criterion 

Dimensionality reduction via class-clipping in- 
volves some loss of “descriptiveness” of the model.,. If 
for example two classes that “explain” well two dif- 
ferent portions of the image are clipped together, the 
new model will probably assign those two portions of 
the image to the same class. This observation suggests 
the criterion for selecting a clipping scheme introduced 
in this section. Our criterion is based on the notion 
of mean entropy, a well-known concept in the fields of 
statistical physics and mixture estimation [4], [6]. 

Given a feature z , the entropy of the posterior dis- 
tribution P(j\z) is defined by [2] 

9 * 

*(*) = ( 4 ) 
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The entropy s(z) measures the softness of the class 
assignment. A distribution with null entropy assigns z 
to exactly one class; the maximum value of the entropy 
is log iV, and is attained if all classes are equally likely 
to have generated z. 

The mean entropy 5 of a model is defined by the 
expectation of s(z) with respect to the “real” distri- 
bution of z . In practice, the mean entropy can be 
estimated by averaging s(z) over the observed image. 
A model with null mean entropy can only perform 
“hard” classification, and will be called degenerate. 
Note that the mean entropy of a model estimated via 
Expectation Maximization is a function of the “tem- 
perature” of the algorithm [6]. 

The following result, whose proof is in the Ap- 
pendix, characterizes the relation between mean en- 
tropy and dimensionality reduction. 

Fact 1 Class-clipping never increases the mean en- 
tropy of a model. 

In general, if a new super-class “takes over” two dif- 
ferent portions of the image that the previous model 
assigns to two classes separately, a significant decrease 
of the mean entropy is expected. Hence, a suitable 
criterion for dimensionality reduction is the following 
one: choose the clipping scheme that minimizes the 
decrement of the mean entropy. 

Unfortunately, the number of possible clipping 
schemes may be very high even for small model di- 
mension. For example, in order to reduce the number 
of classes from 16 to 13 we may choose among 45,500 
different combinations of class clipping. .Measuring 
the decrease of mean entropy for each one of those 
schemes may require a prohibitive computational cost. 

A suboptimal solution can be found using a feist greedy 


Greedy algorithm for 
dimensionality reduction: N — > N — M 

Given the set of posteriors P{j\z ), 1 < j < N : 

Build auxiliary vector R and matrix D. 

R(j) = £’[-P0'|x)logP0>)] 1 1 <j<N: 

(R(j) + R(k) + E[(P(j\i) + P(k\x))- 
D{j.k) = j log(P(j|x) + P(fc|x))] , 1 < k < j < N 
( oo , otherwise 

Initialize an empty list L\ 

Repeat M times: 

(j,k) = arg minDO',*); 

Add k to the list L\ 

Update PQ\z) <r- P(j\z) + P(k\z) } P(k\z) <- 0; 
Update 

Update D(j, k) for k < j , k £ L\ 

Update D(k, j) for k > j , k £ L\ 

Set D(j, k) = oo for j > k; 

Set D(k,j) = oo for j < k; 

Remove the classes indexed by the elements of L. 


Figure 1: The greedy algorithm to select a class- 
clipping scheme (see section 3.1.) 

algorithm that builds a sequence of clippings involv- 
ing only two classes at a time. At each step, the two 
classes that minimize the decrease of the mean entropy 
are selected. The algorithm is described in detail in 
figure 1. 

Figure 2(g) shows an example of Bayesian classi- 
fication after dimensionality reduction from 16 to 4 
classes, based on the mean entropy criterion. Each 
class of the reduced dimension model now represents 
a characteristic area of the image. The computation 
of the optimal clipping scheme, by a Matlab imple- 
mentation of our greedy algorithm, requires about 50 
seconds of computation time on a Power Mac G3 266 
Mhz (the image size is 256 x 380 pixels.) 

3.2 Dimension selection 

Mean entropy can also be used as an indicator to 
determine an appropriate number of classes for the 
reduced dimensionality model. In figure 2(k) we plot- 
ted the mean entropy as a function of the number 
of classes for our example. Note that the algorithm 
for the greedy selection of classes, which reduces the 
dimension by one at a time, allows us to easily com- 
pute these values as a by-product. It is interesting 
to note that the mean entropy does not decrease uni- 
formly as the dimension is reduced; in fact, a number 




of ‘phase transitions’" are observed, corresponding to a 
few “representative” dimensions. As noted above, we 
may expect the result of the segmentation to change 
dramatically in correspondence of an abrupt decrease 
of the mean entropy. For example, figure 2(e) and (f) 
show results corresponding to dimension 10 and 6 re- 
spectively. The two segmentations look very similar; 
indeed, the mean entropy in the two cases is almost 
the same. However, if we reduce the number of classes, 
the mean entropy changes abruptly: this phenomenon 
explains the strong difference between the segmenta- 
tions of figure 2(f) (6 classes) and (g) (4 classes). The 
mean entropy undergoes another large decrement if we 
reduce the dimension form 4 to 3: as shown in figure 
2(h), this is due to the fact that the class represent- 
ing the “soil” and the class representing the “bushes” 
have been clipped together. 

3.3 Equalization 

In the previous sections we have described a strat- 
egy for model fusion that first builds the cartesian 
product of two models, and then performs dimension- 
ality reduction via class-clipping. An implicit assump- 
tions was that the two original models give the same 
contribution to the final segmentation. This hypoth- 
esis does not hold true if the two models have very 
different values of the mean entropy. In this case, the 
model with the smallest entropy “dominates” the com- 
bined model. 

We propose a simple equalization procedure that 
allows to merge two models with different mean en- 
tropies; the procedure can be applied to either one of 
the models. The equalization operator starts from a 
model M and produces a new model with the same 
number of classes N. The entropy of this new model 
can be tuned to match any desired value So < log jV, 
and the associated Bayesian classifier yields the same 
results as the Bayesian classifier for A4. 

The equalization operator simply replaces each pos- 
terior distribution P(j\z) with the new distribution 
P eq (j\z), defined as follows: 

P eq U\:) =c(z)P(j\z)» , q > 0 (5) 

where c is a normalizing coefficient: 



c(:) = 

EjLi PU !-*)“ 


The mean entropy properties of the equalization op- 
erator are summarized by the following result: 


Fact 2 Equalization decreases the mean entropy of a 
non-degenerate model if a > 1, and increases it if 
a < l . 


0 * * i » To Tj >4 "~T» 

Number of modal s 

00 

Figure 2. (a): Test image, (b) Color based segmentation (4 
(’lasses. ) (c) Texture baser! segmentation (1 classes.) (d) Seg- 
mentation after ('artesian product (lb classes.) (e) (h): Seg- 
mentation after model merging ((e): 10 classes, (f): b ( lasses, 
(g): -I ( lasses, (h): 3 ( lasses.) (i),(j): Segmentation after model 
merging (d classes), with mean entropy of color b«ised model set 
to 10 times smaller (i) or 10 times larger (j) than texture baser! 
model, (k) Mean entropy as a function of model dimension. 





(a) (b) 


m 

(c) (d) 

Figure 3: (a): Test image, (b) Color-based super- 
vised classification into the “road” class (yellow) and 
the “grass” area (green.) (c) Texture-based unsu- 
pe^ vised segmentation (3 classes.) (d) Hybrid super- 
vised/unsupervised classification. 

The proof can be found in the Appendix. Note that 
a = 0 implies that the mean entropy of P eq (j\z) is 
equal to logN; the mean entropy of P eq (j\z) can be 
made as small as desired by a suitable large value of 
a. Also note that for each feature c the location of the 
maximum of the posterior distribution is not changed 
by the equalization. 

Now, suppose that the two models to be merged 
have different mean entropies. We may modify one 
of the models via the equalization operator, so that 
its mean entropy matches the mean entropy of the 
other model. The appropriate value of the parameter 
or may be found using any non-linear one-dimensional 
minimization technique. 

In the example of figure 2, the mean entropies of 
the color and texture models were very similar, and 
equalization was not needed. Equalization, however, 
may be used to make either of the models domi- 
nant. For example, figure 2(i) and (j) show the results 
of Bayesian segmentation after equalizing the color- 
based model to a value of the mean entropy respec- 
tively 10 times smaller and 10 times larger than the 
mean entropy of the texture-based model (the com- 
bined model dimension was reduced to 4 by class- 
clipping.) This example shows that equalization is 
a practical and simple method for assigning different 
“weights" to the two models to be merged. 

4 Hybrid classification 

The main differences between supervised classifica- 
tion and unsupervised clustering are: 

1. The classes ('‘labels") of a supervised classifier 

usually represent “‘physical" causes, and therefore 

are not logically interchangeable; 


2. The statistical model is learned from training 
data. 

The Bayesian classifier assigns a feature * to the max- 
imizer of the posterior [2]. In many instances, only 
the conditional likelihoods p(x\j) are learned; how- 
ever, reasonable assumptions about the class priors 
P(j) are often available, and the posteriors can be 
computed using Bayes 1 * * rule. 

In this section we propose to merge a model M 9 
for supervised classification with a model M u for un- 
supervised segmentation (based on a different feature 
space,) to create a “hybrid” classifier which assigns 
each image point to some label of M 9 . The intuition is 
that information from the “supervised model” (which 
identifies clusters in the feature space based on the 
current image) may be used to leverage the classifica- 
tion performed by the “unsupervised model” , which is 
learned from a large training data set and may not be 
optimal for the current instance. 

The merging algorithm discussed in the previous 
sections defines a model M with classes that are the 
union of cartesian products of classes from M s and 
M u . If C represents a generic class of A4, we may 
write 

c=UU(cc ( „) (7) 

v^V w(v) 

where C s and C u are classes of M 9 and M u respec- 
tively, indexed by the corresponding subscripts. To 
complete the definition of the hybrid classification 
model, we need to identify each class C with some 
class of C s . If the set V of classes of M s that form 
the super-class C is composed by just one element v, 
than we simply identify C with C 9 . In general, how- 
ever, V may have more than one element; in this case, 
we identify C with the class C * that maximizes the 
contribution to C, defined by 

^(U(CC W )|:)l= ( 8 ) 

uj(u) 

= p u (ui(«0M 

w(v) 

where E[-] represents the expectation operator, com- 
puted with respect to the “real" distribution of j = 
(~i 1 - 2)1 and Pj (•)•), P u ( | ) and P( | ) represent the 
posteriors of the models M j, M u and M respectively. 

YVe present an example of hybrid classification in 
Figure 3. Figure 3(a) shows a scene with a dirt road on 
the left and dry grass on the right. Supervised color- 
based classification (figure 3(b)) is performed using 
a trained gaussian model. The “road" class and the 
“grass" class have very similar colors; this is the reason 





why pixels in the top-right quadrant are misclassified 
as belonging to the “road” class. Figure 3(c) shows 
the results of unsupervised texture segmentation with 
three classes, computed via Expectation Maximiza- 
tion. The segmenter isolates uniform regions corre- 
sponding to the road and to the grass areas, plus a 
region corresponding to the border of the road. The 
two models are merged into a new model with four 
classes; the final hybrid classification is shown in Fig- 
ure 3(d). The hybrid classifier has correctly labeled 
each one of the four classes of the merged model as ei- 
ther the “road” or the “grass” class. The information 
from the texture model has helped to correctly classify 
most pixels that were misclassified in figure 3(b). 

f 

5 Conclusions 

We have presented a technique for merging to- 
gether two segmentations, based on color and tex- 
ture. Our technique is very general, and in principle 
can be applied also to other classes of features, such 
as motion; it only requires that the posterior distri- 
butions that originated the segmentations are avail- 
able. The results show the effectiveness of the mean 
entropy criterion for reducing the dimensionality of 
the cartesian product of the two mixture models. We 
have also introduced a technique for hybrid super- 
vised/unsupervised classification, based on our merg- 
ing algorithm, that can improve the performance of 
supervised classification using consensus from differ- 
ent features. 
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Appendix 

Proof of Fact 1. A class-clipping operation can al- 
ways be implemented by a sequence of class-clippings 
involving two classes at a time. We show in the fol- 
lowing that the mean entropy can never increase with 
any such step. Assume classes j and k are clipped 
together; the variation A a (r) of the entropy of the 


posterior distribution P(j\z) is equal to 
A s (z) = 

= -(PU\z) + P(k\z)) log (P(j\z) + P(k\z)) 

+P(j \*) log P(j\z) + P(k\z ) log P(k\z) 

< 0 

Since the variation of the mean entropy is equal to the 
expectation of A,(c), the claim is proved. 

Proof of Fact 2. We just need to prove the claim for 
the case a < 1. The proof is based on the following 
two results. 

Lemma t. The entropy of a probability distribution 
increases if two values of the distribution are moved 
closer to each other, while the other values are left 
untouched. 

Proof . The claim is a direct consequence of the con- 
vexity of the function xlogx. 

Corollary l. Let P(j), 1 < j < N be a probability 
distribution and, for a given A' < N, let and J 2 be 
the sets of the indices of the K smallest values and of 
the N - K largest values of P(j) respectively. Now 
form a new distribution P(j) from P(j) by increasing 
some of the values with index in J { while at the same 
time decreasing some of the values with index in J 2 , 
with the requirement that 

max{P(j),j e Ji} < min{P(i), i e J 2 } 

Then the entropy of P(j) is higher than the entropy 

of P(j). 

Proof The transformation from P(j) to P(j) can be 
decomposed into a sequence of steps, each one involv- 
ing just one value with index in J\ and just one value 
with index in Jo. Therefore, by Lemma 1, the entropy 
is increased at each such step. 

Now, it is easy to prove that the function c(z)x a -x, 
with c(r) defined in (6). vanishes in correspondence of 
the value x = c(c) Qt “ l , which is located between the 
smallest and the largest values of P(j\z). Therefore, if 
P(jjc) has non-null entropy, the equalization operator 
(5) with a < l falls into the class of transformations 
considered in Corollary I: the set J\ is composed by 
all the j such that P(j\z) < c(:) a ~ l , the set J 2 is 
composed by all the other indices. This proves that for 
any ^ the entropy of P(J|c) increases as a consequence 
of equalization with a < 1. 
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